Unaligned Binary Codes for Index Compression in Schema-Independent Text Retrieval Systems

نویسندگان

  • Stefan Büttcher
  • Charles L. A. Clarke
چکیده

We examine index compression techniques for schemaindependent inverted files used in text retrieval systems. Schema-independent inverted files contain full positional information for all index terms and allow the structural unit of retrieval to be specified dynamically at query time, rather than statically during index construction. Schemaindependent indices have different characteristics than document-oriented indices, and this difference can affect the effectiveness of index compression algorithms greatly. Our experimental results show that unaligned binary codes that take into account the special properties of schema-independent indices achieve better compression rates than methods designed for compressing document indices and that they can reduce the size of the index by around 15% compared to byte-aligned index compression. Moreover, we present a number of performance-enhancing techniques that may be used to very efficiently decode unaligned codes. Thus, their more compact index representation does not carry the cost of a substantially decreased query processing performance. This contradicts most earlier results on the decoding performance of unaligned codes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compression-Domain Text Indexing and Retrieval

Keyword-based text retrieval engines have been and will continue to be essential to text-based information access systems because they serve as the basic building blocks to high-level text analysis systems. Traditionally, text compression and text retrieval are teated as independent problems. Text les are compressed and indexed separately. To answer a keyword-based query, text les are rst uncom...

متن کامل

Using Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine

Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...

متن کامل

Index Compression Using Fixed Binary Codewords

Document retrieval and web search engines index large quantities of text. The static costs associated with storing the index can be traded against dynamic costs associated with using it during query evaluation. Typically, index representations that are effective and obtain good compression tend not to be efficient, in that they require more operations during query processing. In this paper we d...

متن کامل

SASE: Implementation of a Compressed Text Search Engine

Keyword based search engines are the basic building block of text retrieval systems. Higher level systems like content sensitive search engines and knowledgebased systems still rely on keyword search as the underlying text retrieval mechanism. With the explosive growth in content, Internet and Intranet information repositories require efficient mechanisms to store as well as index data. In this...

متن کامل

Toward Remote Object Coherence with Compiled Object Serialization for Distributed Computing with XML Web Services

Cross-platform object-level coherence in Web services-based distributed systems and grids requires lossless serialization to ensure programming-language specific objects are safely transmitted, manipulated, and stored. However, Web services development tools often suffer from lossy forms of XML serialization, which diminishes the usefulness of XML Web services as a competitive approach to binar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006